TELECOM CUSTOMER CHURN PREDICTION¶
Introduction
Customer Churn refers to the phenomenon where customers stop using a company's products or services.
In the telecom industry, churn is a major concern because retaining existing customers is significantly more cost-effective than acquiring new ones.
Telecom companies generate massive amounts of data, including customer demographics, account information, service usage patterns, and contract details.
By analyzing this data, we can build predictive models to identify customers who are likely to churn, allowing companies to take proactive measures to retain them.
In this project, we aim to:
- Understand the key factors influencing customer churn
- Explore the dataset using Exploratory Data Analysis (EDA)
- Apply Machine Learning algorithms to predict churn
- Evaluate model performance and provide actionable insights for customer retention
The main objectives of this project are:
Understand Customer Behavior
Analyze customer demographics, account details, and service usage to identify patterns that influence churn.Identify Key Churn Factors
Determine which factors most strongly contribute to customer churn, such as contract type, payment method, or usage frequency.Predict Customer Churn
Build and evaluate machine learning models (e.g., Random Forest, Logistic Regression, SVM) to predict which customers are likely to leave.Support Decision-Making
Provide actionable insights that help telecom companies implement strategies to retain customers and reduce churn.Evaluate Model Performance
Measure accuracy, precision, recall, and other metrics to select the most effective predictive model.
Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import warnings
warnings.filterwarnings('ignore')
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
Load the dataset
df=pd.read_csv("chrun.csv")
print(df)
customerID gender SeniorCitizen Partner Dependents tenure \
0 7590-VHVEG Female 0 Yes No 1
1 5575-GNVDE Male 0 No No 34
2 3668-QPYBK Male 0 No No 2
3 7795-CFOCW Male 0 No No 45
4 9237-HQITU Female 0 No No 2
... ... ... ... ... ... ...
7038 6840-RESVB Male 0 Yes Yes 24
7039 2234-XADUH Female 0 Yes Yes 72
7040 4801-JZAZL Female 0 Yes Yes 11
7041 8361-LTMKD Male 1 Yes No 4
7042 3186-AJIEK Male 0 No No 66
PhoneService MultipleLines InternetService OnlineSecurity ... \
0 No No phone service DSL No ...
1 Yes No DSL Yes ...
2 Yes No DSL Yes ...
3 No No phone service DSL Yes ...
4 Yes No Fiber optic No ...
... ... ... ... ... ...
7038 Yes Yes DSL Yes ...
7039 Yes Yes Fiber optic No ...
7040 No No phone service DSL Yes ...
7041 Yes Yes Fiber optic No ...
7042 Yes No Fiber optic Yes ...
DeviceProtection TechSupport StreamingTV StreamingMovies Contract \
0 No No No No Month-to-month
1 Yes No No No One year
2 No No No No Month-to-month
3 Yes Yes No No One year
4 No No No No Month-to-month
... ... ... ... ... ...
7038 Yes Yes Yes Yes One year
7039 Yes No Yes Yes One year
7040 No No No No Month-to-month
7041 No No No No Month-to-month
7042 Yes Yes Yes Yes Two year
PaperlessBilling PaymentMethod MonthlyCharges TotalCharges \
0 Yes Electronic check 29.85 29.85
1 No Mailed check 56.95 1889.5
2 Yes Mailed check 53.85 108.15
3 No Bank transfer (automatic) 42.30 1840.75
4 Yes Electronic check 70.70 151.65
... ... ... ... ...
7038 Yes Mailed check 84.80 1990.5
7039 Yes Credit card (automatic) 103.20 7362.9
7040 Yes Electronic check 29.60 346.45
7041 Yes Mailed check 74.40 306.6
7042 Yes Bank transfer (automatic) 105.65 6844.5
Churn
0 No
1 No
2 Yes
3 No
4 Yes
... ...
7038 No
7039 No
7040 No
7041 Yes
7042 No
[7043 rows x 21 columns]
df.head(10)
| customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | ... | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7590-VHVEG | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 1 | 5575-GNVDE | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | ... | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.5 | No |
| 2 | 3668-QPYBK | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | ... | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 3 | 7795-CFOCW | Male | 0 | No | No | 45 | No | No phone service | DSL | Yes | ... | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 4 | 9237-HQITU | Female | 0 | No | No | 2 | Yes | No | Fiber optic | No | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
| 5 | 9305-CDSKC | Female | 0 | No | No | 8 | Yes | Yes | Fiber optic | No | ... | Yes | No | Yes | Yes | Month-to-month | Yes | Electronic check | 99.65 | 820.5 | Yes |
| 6 | 1452-KIOVK | Male | 0 | No | Yes | 22 | Yes | Yes | Fiber optic | No | ... | No | No | Yes | No | Month-to-month | Yes | Credit card (automatic) | 89.10 | 1949.4 | No |
| 7 | 6713-OKOMC | Female | 0 | No | No | 10 | No | No phone service | DSL | Yes | ... | No | No | No | No | Month-to-month | No | Mailed check | 29.75 | 301.9 | No |
| 8 | 7892-POOKP | Female | 0 | Yes | No | 28 | Yes | Yes | Fiber optic | No | ... | Yes | Yes | Yes | Yes | Month-to-month | Yes | Electronic check | 104.80 | 3046.05 | Yes |
| 9 | 6388-TABGU | Male | 0 | No | Yes | 62 | Yes | No | DSL | Yes | ... | No | No | No | No | One year | No | Bank transfer (automatic) | 56.15 | 3487.95 | No |
10 rows × 21 columns
The data set includes information about:
Customers who left within the last month – the column is called Churn
Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
Customer account information - how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
Demographic info about customers – gender, age range, and if they have partners and dependents
df.shape
(7043, 21)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7043 entries, 0 to 7042 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customerID 7043 non-null object 1 gender 7043 non-null object 2 SeniorCitizen 7043 non-null int64 3 Partner 7043 non-null object 4 Dependents 7043 non-null object 5 tenure 7043 non-null int64 6 PhoneService 7043 non-null object 7 MultipleLines 7043 non-null object 8 InternetService 7043 non-null object 9 OnlineSecurity 7043 non-null object 10 OnlineBackup 7043 non-null object 11 DeviceProtection 7043 non-null object 12 TechSupport 7043 non-null object 13 StreamingTV 7043 non-null object 14 StreamingMovies 7043 non-null object 15 Contract 7043 non-null object 16 PaperlessBilling 7043 non-null object 17 PaymentMethod 7043 non-null object 18 MonthlyCharges 7043 non-null float64 19 TotalCharges 7043 non-null object 20 Churn 7043 non-null object dtypes: float64(1), int64(2), object(18) memory usage: 1.1+ MB
df.columns.values
array(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract',
'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges',
'TotalCharges', 'Churn'], dtype=object)
df.dtypes
customerID object gender object SeniorCitizen int64 Partner object Dependents object tenure int64 PhoneService object MultipleLines object InternetService object OnlineSecurity object OnlineBackup object DeviceProtection object TechSupport object StreamingTV object StreamingMovies object Contract object PaperlessBilling object PaymentMethod object MonthlyCharges float64 TotalCharges object Churn object dtype: object
Visualize the missing value
msno.matrix(df)
<Axes: >
Data Manipulation
df=df.drop(['customerID'],axis=1)
df
| gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | Yes | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 1 | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | No | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.5 | No |
| 2 | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | Yes | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 3 | Male | 0 | No | No | 45 | No | No phone service | DSL | Yes | No | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 4 | Female | 0 | No | No | 2 | Yes | No | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7038 | Male | 0 | Yes | Yes | 24 | Yes | Yes | DSL | Yes | No | Yes | Yes | Yes | Yes | One year | Yes | Mailed check | 84.80 | 1990.5 | No |
| 7039 | Female | 0 | Yes | Yes | 72 | Yes | Yes | Fiber optic | No | Yes | Yes | No | Yes | Yes | One year | Yes | Credit card (automatic) | 103.20 | 7362.9 | No |
| 7040 | Female | 0 | Yes | Yes | 11 | No | No phone service | DSL | Yes | No | No | No | No | No | Month-to-month | Yes | Electronic check | 29.60 | 346.45 | No |
| 7041 | Male | 1 | Yes | No | 4 | Yes | Yes | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Mailed check | 74.40 | 306.6 | Yes |
| 7042 | Male | 0 | No | No | 66 | Yes | No | Fiber optic | Yes | No | Yes | Yes | Yes | Yes | Two year | Yes | Bank transfer (automatic) | 105.65 | 6844.5 | No |
7043 rows × 20 columns
On deep analysis, we can find some indirect missingness in our data (which can be in form of blankspaces). Let's see that!
df['TotalCharges']=pd.to_numeric(df.TotalCharges,errors='coerce')
df.dtypes
gender object SeniorCitizen int64 Partner object Dependents object tenure int64 PhoneService object MultipleLines object InternetService object OnlineSecurity object OnlineBackup object DeviceProtection object TechSupport object StreamingTV object StreamingMovies object Contract object PaperlessBilling object PaymentMethod object MonthlyCharges float64 TotalCharges float64 Churn object dtype: object
df.isnull().sum()
gender 0 SeniorCitizen 0 Partner 0 Dependents 0 tenure 0 PhoneService 0 MultipleLines 0 InternetService 0 OnlineSecurity 0 OnlineBackup 0 DeviceProtection 0 TechSupport 0 StreamingTV 0 StreamingMovies 0 Contract 0 PaperlessBilling 0 PaymentMethod 0 MonthlyCharges 0 TotalCharges 11 Churn 0 dtype: int64
msno.matrix(df)
<Axes: >
df[np.isnan(df['TotalCharges'])]
| gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 488 | Female | 0 | Yes | Yes | 0 | No | No phone service | DSL | Yes | No | Yes | Yes | Yes | No | Two year | Yes | Bank transfer (automatic) | 52.55 | NaN | No |
| 753 | Male | 0 | No | Yes | 0 | Yes | No | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Two year | No | Mailed check | 20.25 | NaN | No |
| 936 | Female | 0 | Yes | Yes | 0 | Yes | No | DSL | Yes | Yes | Yes | No | Yes | Yes | Two year | No | Mailed check | 80.85 | NaN | No |
| 1082 | Male | 0 | Yes | Yes | 0 | Yes | Yes | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Two year | No | Mailed check | 25.75 | NaN | No |
| 1340 | Female | 0 | Yes | Yes | 0 | No | No phone service | DSL | Yes | Yes | Yes | Yes | Yes | No | Two year | No | Credit card (automatic) | 56.05 | NaN | No |
| 3331 | Male | 0 | Yes | Yes | 0 | Yes | No | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Two year | No | Mailed check | 19.85 | NaN | No |
| 3826 | Male | 0 | Yes | Yes | 0 | Yes | Yes | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Two year | No | Mailed check | 25.35 | NaN | No |
| 4380 | Female | 0 | Yes | Yes | 0 | Yes | No | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Two year | No | Mailed check | 20.00 | NaN | No |
| 5218 | Male | 0 | Yes | Yes | 0 | Yes | No | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | One year | Yes | Mailed check | 19.70 | NaN | No |
| 6670 | Female | 0 | Yes | Yes | 0 | Yes | Yes | DSL | No | Yes | Yes | Yes | Yes | No | Two year | No | Mailed check | 73.35 | NaN | No |
| 6754 | Male | 0 | No | Yes | 0 | Yes | Yes | DSL | Yes | Yes | No | Yes | No | No | Two year | Yes | Bank transfer (automatic) | 61.90 | NaN | No |
It can also be noted that the Tenure column is 0 for these entries even though the MonthlyCharges column is not empty. Let's see if there are any other 0 values in the tenure column.
df[df['tenure']==0].index
Index([488, 753, 936, 1082, 1340, 3331, 3826, 4380, 5218, 6670, 6754], dtype='int64')
There are no additional missing values in the Tenure column. Let's delete the rows with missing values in Tenure columns since there are only 11 rows and deleting them will not affect the data.
df.drop(labels=df[df['tenure']==0].index,axis=0,inplace=True)
df
| gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | Yes | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 1 | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | No | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.50 | No |
| 2 | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | Yes | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 3 | Male | 0 | No | No | 45 | No | No phone service | DSL | Yes | No | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 4 | Female | 0 | No | No | 2 | Yes | No | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7038 | Male | 0 | Yes | Yes | 24 | Yes | Yes | DSL | Yes | No | Yes | Yes | Yes | Yes | One year | Yes | Mailed check | 84.80 | 1990.50 | No |
| 7039 | Female | 0 | Yes | Yes | 72 | Yes | Yes | Fiber optic | No | Yes | Yes | No | Yes | Yes | One year | Yes | Credit card (automatic) | 103.20 | 7362.90 | No |
| 7040 | Female | 0 | Yes | Yes | 11 | No | No phone service | DSL | Yes | No | No | No | No | No | Month-to-month | Yes | Electronic check | 29.60 | 346.45 | No |
| 7041 | Male | 1 | Yes | No | 4 | Yes | Yes | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Mailed check | 74.40 | 306.60 | Yes |
| 7042 | Male | 0 | No | No | 66 | Yes | No | Fiber optic | Yes | No | Yes | Yes | Yes | Yes | Two year | Yes | Bank transfer (automatic) | 105.65 | 6844.50 | No |
7032 rows × 20 columns
df.skew(numeric_only=True)
SeniorCitizen 1.831103 tenure 0.237731 MonthlyCharges -0.222103 TotalCharges 0.961642 dtype: float64
To solve the problem of missing values in TotalCharges column, I decided to fill it with the mean of TotalCharges values.
df.fillna(df['TotalCharges'].mean())
| gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | Yes | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 1 | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | No | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.50 | No |
| 2 | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | Yes | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 3 | Male | 0 | No | No | 45 | No | No phone service | DSL | Yes | No | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 4 | Female | 0 | No | No | 2 | Yes | No | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7038 | Male | 0 | Yes | Yes | 24 | Yes | Yes | DSL | Yes | No | Yes | Yes | Yes | Yes | One year | Yes | Mailed check | 84.80 | 1990.50 | No |
| 7039 | Female | 0 | Yes | Yes | 72 | Yes | Yes | Fiber optic | No | Yes | Yes | No | Yes | Yes | One year | Yes | Credit card (automatic) | 103.20 | 7362.90 | No |
| 7040 | Female | 0 | Yes | Yes | 11 | No | No phone service | DSL | Yes | No | No | No | No | No | Month-to-month | Yes | Electronic check | 29.60 | 346.45 | No |
| 7041 | Male | 1 | Yes | No | 4 | Yes | Yes | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Mailed check | 74.40 | 306.60 | Yes |
| 7042 | Male | 0 | No | No | 66 | Yes | No | Fiber optic | Yes | No | Yes | Yes | Yes | Yes | Two year | Yes | Bank transfer (automatic) | 105.65 | 6844.50 | No |
7032 rows × 20 columns
df.isnull().sum()
gender 0 SeniorCitizen 0 Partner 0 Dependents 0 tenure 0 PhoneService 0 MultipleLines 0 InternetService 0 OnlineSecurity 0 OnlineBackup 0 DeviceProtection 0 TechSupport 0 StreamingTV 0 StreamingMovies 0 Contract 0 PaperlessBilling 0 PaymentMethod 0 MonthlyCharges 0 TotalCharges 0 Churn 0 dtype: int64
df['SeniorCitizen']=df['SeniorCitizen'].map({0:'No',1:'Yes'})
df
| gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Female | No | Yes | No | 1 | No | No phone service | DSL | No | Yes | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 1 | Male | No | No | No | 34 | Yes | No | DSL | Yes | No | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.50 | No |
| 2 | Male | No | No | No | 2 | Yes | No | DSL | Yes | Yes | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 3 | Male | No | No | No | 45 | No | No phone service | DSL | Yes | No | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 4 | Female | No | No | No | 2 | Yes | No | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7038 | Male | No | Yes | Yes | 24 | Yes | Yes | DSL | Yes | No | Yes | Yes | Yes | Yes | One year | Yes | Mailed check | 84.80 | 1990.50 | No |
| 7039 | Female | No | Yes | Yes | 72 | Yes | Yes | Fiber optic | No | Yes | Yes | No | Yes | Yes | One year | Yes | Credit card (automatic) | 103.20 | 7362.90 | No |
| 7040 | Female | No | Yes | Yes | 11 | No | No phone service | DSL | Yes | No | No | No | No | No | Month-to-month | Yes | Electronic check | 29.60 | 346.45 | No |
| 7041 | Male | Yes | Yes | No | 4 | Yes | Yes | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Mailed check | 74.40 | 306.60 | Yes |
| 7042 | Male | No | No | No | 66 | Yes | No | Fiber optic | Yes | No | Yes | Yes | Yes | Yes | Two year | Yes | Bank transfer (automatic) | 105.65 | 6844.50 | No |
7032 rows × 20 columns
df.nunique()
gender 2 SeniorCitizen 2 Partner 2 Dependents 2 tenure 72 PhoneService 2 MultipleLines 3 InternetService 3 OnlineSecurity 3 OnlineBackup 3 DeviceProtection 3 TechSupport 3 StreamingTV 3 StreamingMovies 3 Contract 3 PaperlessBilling 2 PaymentMethod 4 MonthlyCharges 1584 TotalCharges 6530 Churn 2 dtype: int64
df['InternetService'].describe()
count 7032 unique 3 top Fiber optic freq 3096 Name: InternetService, dtype: object
df.dtypes
gender object SeniorCitizen object Partner object Dependents object tenure int64 PhoneService object MultipleLines object InternetService object OnlineSecurity object OnlineBackup object DeviceProtection object TechSupport object StreamingTV object StreamingMovies object Contract object PaperlessBilling object PaymentMethod object MonthlyCharges float64 TotalCharges float64 Churn object dtype: object
num=['tenure','MonthlyCharges','TotalCharges']
df[num].describe()
| tenure | MonthlyCharges | TotalCharges | |
|---|---|---|---|
| count | 7032.000000 | 7032.000000 | 7032.000000 |
| mean | 32.421786 | 64.798208 | 2283.300441 |
| std | 24.545260 | 30.085974 | 2266.771362 |
| min | 1.000000 | 18.250000 | 18.800000 |
| 25% | 9.000000 | 35.587500 | 401.450000 |
| 50% | 29.000000 | 70.350000 | 1397.475000 |
| 75% | 55.000000 | 89.862500 | 3794.737500 |
| max | 72.000000 | 118.750000 | 8684.800000 |
Data Visualization
g_labels=['Male','Female']
c_labels=['No','Yes']
fig=make_subplots(rows=1,cols=2,specs=[[{'type':'domain'},{'type':'domain'}]])
fig.add_trace(go.Pie(labels=g_labels, values=df['gender'].value_counts(), name="Gender"),1,1)
fig.add_trace(go.Pie(labels=c_labels, values=df['Churn'].value_counts(),name='Churn'),1,2)
fig.update_traces(hole=.4,hoverinfo="label+percent+name",textfont_size=16)
fig.update_layout(
title_text="<b>Gender and Churn Distribution<b>",
annotations=[dict(text='Gender',x=0.19,y=0.5,font_size=16,showarrow=False),
dict(text="Churn",x=0.8,y=0.5,font_size=16,showarrow=False)])
fig.show()
26.6 % of customers switched to another firm. The pie charts show that the dataset has an almost equal distribution of male and female customers. Overall churn percentage is smaller compared to customers who stayed. This suggests that gender does not have a strong impact on churn likelihood in this dataset.
df["Churn"][df["Churn"]=='No'].groupby(by=df['gender']).count()
gender Female 2544 Male 2619 Name: Churn, dtype: int64
df["Churn"][df["Churn"]=='Yes'].groupby(by=df['gender']).count()
gender Female 939 Male 930 Name: Churn, dtype: int64
plt.figure(figsize=(6,6))
labels=['Churn: Yes',"Churn: No"]
values=[1869,5163]
colors=["plum","wheat"]
labels_gender=["F","M","F","M"]
values_gender=[939,930,2544,2619]
colors_gender=['skyblue','lightpink','skyblue','lightpink']
explode=[0.3,0.3]
explode_gender=[0.1,0.1,0.1,0.1]
textprops={"fontsize":13}
textprops1={"fontsize":10}
plt.pie(
values,
labels=labels,
autopct='%1.1f%%',
pctdistance=1.08,
labeldistance=0.8,
colors=colors,
startangle=90,
frame=True,
explode=explode,
radius=10,
textprops=textprops,
counterclock=True
)
plt.pie(
values_gender,
labels=labels_gender,
autopct='%1.1f%%',
pctdistance=0.55,
labeldistance=0.82,
colors=colors_gender,
startangle=90,
explode=explode_gender,
radius=7,
textprops=textprops1,
counterclock=True
)
centre_circle=plt.Circle((0,0),5,color='black',fc='white',linewidth=0)
fig=plt.gcf()
fig.gca().add_artist(centre_circle)
plt.title("Churn distribution w.r.t Gender: Male(M) , Female(F)",fontsize=11,y=1.1)
plt.axis('equal')
plt.tight_layout()
plt.legend(fontsize=8)
plt.show()
Both males and females have similar churn proportions. This indicates that customer gender is not a major factor influencing whether they leave the service. Hence, churn behavior appears consistent across both genders.
df.dtypes
gender object SeniorCitizen object Partner object Dependents object tenure int64 PhoneService object MultipleLines object InternetService object OnlineSecurity object OnlineBackup object DeviceProtection object TechSupport object StreamingTV object StreamingMovies object Contract object PaperlessBilling object PaymentMethod object MonthlyCharges float64 TotalCharges float64 Churn object dtype: object
fig=px.histogram(df, x='Churn', color='Contract', barmode='group', title="<b>Customer Contract Distribution<b>")
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()
Month-to-month contracts show the highest churn rate, unlike one- or two-year contracts. Customers with long-term commitments tend to remain loyal to the service provider. Offering incentives for longer contracts can help minimize churn.
labels=df['PaymentMethod'].unique()
values=df['PaymentMethod'].value_counts()
fig=go.Figure(data=[go.Pie(labels=labels,values=values,hole=0.3)])
fig.update_layout(width=700, title="<b>Payment Method Distribution<b>")
fig.show()
fig=px.histogram(df, x='Churn', color='PaymentMethod', title="<b>Customer Payment distribution w.r.t Churn<b>")
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()
Customers paying through electronic checks show the highest churn. Those using credit cards or bank transfers remain more consistent customers. This may reflect convenience, automation, and lower billing issues with digital payments.
df['InternetService'].unique()
array(['DSL', 'Fiber optic', 'No'], dtype=object)
df[df["gender"]=="Female"][["InternetService","Churn"]].value_counts()
InternetService Churn DSL No 965 Fiber optic No 889 No No 690 Fiber optic Yes 664 DSL Yes 219 No Yes 56 Name: count, dtype: int64
df[df["gender"]=="Male"][["InternetService","Churn"]].value_counts()
InternetService Churn DSL No 992 Fiber optic No 910 No No 717 Fiber optic Yes 633 DSL Yes 240 No Yes 57 Name: count, dtype: int64
fig=go.Figure()
fig.add_trace(go.Bar(
x=[['Churn:No', 'Churn:No', 'Churn:Yes', 'Churn:Yes'],
['Female', 'Male', 'Female', 'Male']],
y=[965,992,219,240],
name='DSL'
))
fig.add_trace(go.Bar(
x=[['Churn:No', 'Churn:No', 'Churn:Yes', 'Churn:Yes'],
['Female', 'Male', 'Female', 'Male']],
y=[889,910,664,633],
name='Fibre Optic'
))
fig.add_trace(go.Bar(
x=[['Churn:No', 'Churn:No', 'Churn:Yes', 'Churn:Yes'],
['Female', 'Male', 'Female', 'Male']],
y=[690,717,717,57]
))
fig.update_layout(width=900, height=400, title="<b>Churn Distribution w.r.t to Internet services and Gender<b>")
fig.show()
A lot of customers choose the Fiber optic service and it's also evident that the customers who use Fiber optic have high churn rate, this might suggest a dissatisfaction with this type of internet service. Customers having DSL service are majority in number and have less churn rate compared to Fibre optic service.
color={"Yes":"violet","No":"cyan"}
fig=px.histogram(
df,
x='Churn',
color='Dependents',
barmode='group',
title="<b>Dependents Distribution<b>",
color_discrete_map=color
)
fig.update_layout(
width=700,
height=500,
bargap=0.1
)
fig.show()
Customers with partners or dependents tend to stay longer and churn less frequently. This could be because family users rely more on stable communication services. Hence, single or independent customers are more at risk of leaving.
color={"Yes":"hotpink","No":"lightblue"}
fig=px.histogram(
df,
x="Churn",
color="Partner",
barmode="group",
title="<b>Churn Distribution w.r.t Partner<b>",
color_discrete_map=color
)
fig.update_layout(
width=700,
height=500,
bargap=0.1
)
fig.show()
color={"Yes":"palegreen","No":"khaki"}
fig=px.histogram(
df,
x="Churn",
color="SeniorCitizen",
barmode="group",
title="<b>Churn Distribution w.r.t Senior Citizen<b>",
color_discrete_map=color
)
fig.update_layout(
width=700,
height=500,
bargap=0.1
)
fig.show()
Senior citizens exhibit a noticeably higher churn rate than non-senior customers. This suggests that age or technology comfort level may influence customer satisfaction. Older customers may require more personalized support to improve retention.
color={"Yes":"deeppink","No":"darkviolet","No internet service":"lightgreen"}
fig=px.histogram(
df,
x="Churn",
color="OnlineSecurity",
barmode="group",
title="<b>Churn Distribution w.r.t Online Security<b>",
color_discrete_map=color
)
fig.update_layout(
width=700,
height=500,
bargap=0.1
)
fig.show()
Customers lacking online security services churn more often than those who have them. It shows that value-added services like security increase satisfaction and retention. Encouraging such subscriptions could help in lowering churn rates.
color={"Yes":"deeppink","No":"darkviolet","No phone service":"lightgreen"}
fig=px.histogram(
df,
x="Churn",
color="MultipleLines",
barmode="group",
title="<b>Churn Distribution w.r.t MultipleLines<b>",
color_discrete_map=color
)
fig.update_layout(
width=700,
height=500,
bargap=0.1
)
fig.show()
Customers with multiple lines or bundled services tend to churn less. This indicates that bundled or multi-service offerings increase stickiness. Cross-selling more services can be an effective churn reduction strategy.
color={"Yes":"black","No":"red"}
fig=px.histogram(
df,
x="Churn",
color="PaperlessBilling",
barmode="group",
title="<b>Churn Distribution w.r.t PaperlessBilling<b>",
color_discrete_map=color
)
fig.update_layout(
width=700,
height=500,
bargap=0.1
)
fig.show()
Paperless billing users show higher churn rates than those using mailed bills. Such customers may be more digitally active and more likely to switch services. Targeted offers or engagement strategies for digital users could help retention.
fig=px.scatter(
df,
x="tenure",
y="MonthlyCharges",
color="Churn",
title="<b>Tenure vs Monthly Charges By churn"
)
fig.update_layout(
width=800,
height=600
)
fig.show()
fig=px.scatter(
df,
x="tenure",
y="MonthlyCharges",
size="TotalCharges",
color="Churn",
title="<b>Customer value By Tenure and Charges<b>"
)
fig.update_layout(
width=800,
height=700
)
fig.show()
fig=px.box(
df,
x="Contract",
y="MonthlyCharges",
color="Churn",
title="<b>Monthly Charges By Contract type and Churn<b>"
)
fig.update_layout(
height=600
)
fig.show()
fig=px.violin(
df,
x="InternetService",
y="tenure",
color="Churn",
title="<b>Tenure Distribution by Internet Services and Churn<b>"
)
fig.update_layout(
height=600,
)
fig.show()
corr=df.corr(numeric_only=True)
fig=px.imshow(
corr,
text_auto=True,
color_continuous_scale="PiYG",
title="Correlation Heatmap"
)
fig.update_layout(
width=700,
height=500
)
fig.show()
sns.pairplot(df,vars=["tenure","MonthlyCharges","TotalCharges"],hue="Churn")
<seaborn.axisgrid.PairGrid at 0x1d61ec22e40>
Encoding the categorical columns
le=LabelEncoder()
for col in df.select_dtypes("object").columns:
df[col]=le.fit_transform(df[col])
df
| gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 29.85 | 29.85 | 0 |
| 1 | 1 | 0 | 0 | 0 | 34 | 1 | 0 | 0 | 2 | 0 | 2 | 0 | 0 | 0 | 1 | 0 | 3 | 56.95 | 1889.50 | 0 |
| 2 | 1 | 0 | 0 | 0 | 2 | 1 | 0 | 0 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 3 | 53.85 | 108.15 | 1 |
| 3 | 1 | 0 | 0 | 0 | 45 | 0 | 1 | 0 | 2 | 0 | 2 | 2 | 0 | 0 | 1 | 0 | 0 | 42.30 | 1840.75 | 0 |
| 4 | 0 | 0 | 0 | 0 | 2 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 70.70 | 151.65 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7038 | 1 | 0 | 1 | 1 | 24 | 1 | 2 | 0 | 2 | 0 | 2 | 2 | 2 | 2 | 1 | 1 | 3 | 84.80 | 1990.50 | 0 |
| 7039 | 0 | 0 | 1 | 1 | 72 | 1 | 2 | 1 | 0 | 2 | 2 | 0 | 2 | 2 | 1 | 1 | 1 | 103.20 | 7362.90 | 0 |
| 7040 | 0 | 0 | 1 | 1 | 11 | 0 | 1 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 29.60 | 346.45 | 0 |
| 7041 | 1 | 1 | 1 | 0 | 4 | 1 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 3 | 74.40 | 306.60 | 1 |
| 7042 | 1 | 0 | 0 | 0 | 66 | 1 | 0 | 1 | 2 | 0 | 2 | 2 | 2 | 2 | 2 | 1 | 0 | 105.65 | 6844.50 | 0 |
7032 rows × 20 columns
x=df.drop('Churn',axis=1)
y=df['Churn']
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)
Model Training
dt=DecisionTreeClassifier(criterion="entropy",random_state=42)
dt.fit(x_train,y_train)
dt_preds=dt.predict(x_test)
rf=RandomForestClassifier(random_state=42)
rf.fit(x_train,y_train)
rf_preds = rf.predict(x_test)
print("Decision Tree Accuracy:", accuracy_score(y_test, dt_preds))
print("Random Forest Accuracy:", accuracy_score(y_test, rf_preds))
Decision Tree Accuracy: 0.7448471926083866 Random Forest Accuracy: 0.7924662402274343
Here we can see that Random Forest model has more accuracy than the Decision Tree model.
print("DT Confusion Matrix:\n", confusion_matrix(y_test, dt_preds))
print("RF Confusion Matrix:\n", confusion_matrix(y_test, rf_preds))
DT Confusion Matrix: [[841 192] [167 207]] RF Confusion Matrix: [[932 101] [191 183]]
Confusion Matrix
dt_conf=confusion_matrix(y_test, dt_preds)
sns.heatmap(
dt_conf,
annot=True,
fmt='d',
cmap='Blues',
xticklabels=['No Churn','Churn'],
yticklabels=['No Churn','Churn'],
)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix of Decision Tree')
plt.show()
rf_conf=confusion_matrix(y_test, rf_preds)
sns.heatmap(
rf_conf,
annot=True,
fmt='d',
cmap='magma',
xticklabels=['No Churn','Churn'],
yticklabels=['No Churn','Churn'],
)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix of Random Forest')
plt.show()
Classsification Reports
dt_report=classification_report(y_test,dt_preds)
print(dt_report)
precision recall f1-score support
0 0.83 0.81 0.82 1033
1 0.52 0.55 0.54 374
accuracy 0.74 1407
macro avg 0.68 0.68 0.68 1407
weighted avg 0.75 0.74 0.75 1407
rf_report=classification_report(y_test,rf_preds)
print(rf_report)
precision recall f1-score support
0 0.83 0.90 0.86 1033
1 0.64 0.49 0.56 374
accuracy 0.79 1407
macro avg 0.74 0.70 0.71 1407
weighted avg 0.78 0.79 0.78 1407